346 research outputs found
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Surrogate Gradient Spiking Neural Networks as Encoders for Large Vocabulary Continuous Speech Recognition
Compared to conventional artificial neurons that produce dense and
real-valued responses, biologically-inspired spiking neurons transmit sparse
and binary information, which can also lead to energy-efficient
implementations. Recent research has shown that spiking neural networks can be
trained like standard recurrent neural networks using the surrogate gradient
method. They have shown promising results on speech command recognition tasks.
Using the same technique, we show that they are scalable to large vocabulary
continuous speech recognition, where they are capable of replacing LSTMs in the
encoder with only minor loss of performance. This suggests that they may be
applicable to more involved sequence-to-sequence tasks. Moreover, in contrast
to their recurrent non-spiking counterparts, they show robustness to exploding
gradient problems without the need to use gates
A t-distribution based operator for enhancing out of distribution robustness of neural network classifiers
Neural Network (NN) classifiers can assign extreme probabilities to samples
that have not appeared during training (out-of-distribution samples) resulting
in erroneous and unreliable predictions. One of the causes for this unwanted
behaviour lies in the use of the standard softmax operator which pushes the
posterior probabilities to be either zero or unity hence failing to model
uncertainty. The statistical derivation of the softmax operator relies on the
assumption that the distributions of the latent variables for a given class are
Gaussian with known variance. However, it is possible to use different
assumptions in the same derivation and attain from other families of
distributions as well. This allows derivation of novel operators with more
favourable properties. Here, a novel operator is proposed that is derived using
-distributions which are capable of providing a better description of
uncertainty. It is shown that classifiers that adopt this novel operator can be
more robust to out of distribution samples, often outperforming NNs that use
the standard softmax operator. These enhancements can be reached with minimal
changes to the NN architecture.Comment: 5 pages, 5 figures, to be published in IEEE Signal Processing
Letters, reproducible code https://github.com/idiap/tsoftma
Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding
Recently, large pretrained language models have demonstrated strong language
understanding capabilities. This is particularly reflected in their zero-shot
and in-context learning abilities on downstream tasks through prompting. To
assess their impact on spoken language understanding (SLU), we evaluate several
such models like ChatGPT and OPT of different sizes on multiple benchmarks. We
verify the emergent ability unique to the largest models as they can reach
intent classification accuracy close to that of supervised models with zero or
few shots on various languages given oracle transcripts. By contrast, the
results for smaller models fitting a single GPU fall far behind. We note that
the error cases often arise from the annotation scheme of the dataset;
responses from ChatGPT are still reasonable. We show, however, that the model
is worse at slot filling, and its performance is sensitive to ASR errors,
suggesting serious challenges for the application of those textual models on
SLU.Comment: 6 pages, 2 figures; Accepted by Interspeech 202
A Bayesian Approach to Recurrence in Neural Networks
We begin by reiterating that common neural network activation functions have
simple Bayesian origins. In this spirit, we go on to show that Bayes's theorem
also implies a simple recurrence relation; this leads to a Bayesian recurrent
unit with a prescribed feedback formulation. We show that introduction of a
context indicator leads to a variable feedback that is similar to the forget
mechanism in conventional recurrent units. A similar approach leads to a
probabilistic input gate. The Bayesian formulation leads naturally to the two
pass algorithm of the Kalman smoother or forward-backward algorithm, meaning
that inference naturally depends upon future inputs as well as past ones.
Experiments on speech recognition confirm that the resulting architecture can
perform as well as a bidirectional recurrent network with the same number of
parameters as a unidirectional one. Further, when configured explicitly
bidirectionally, the architecture can exceed the performance of a conventional
bidirectional recurrence
Ad Hoc Microphone Array Calibration: Euclidean Distance Matrix Completion Algorithm and Theoretical Guarantees
This paper addresses the problem of ad hoc microphone array calibration where
only partial information about the distances between microphones is available.
We construct a matrix consisting of the pairwise distances and propose to
estimate the missing entries based on a novel Euclidean distance matrix
completion algorithm by alternative low-rank matrix completion and projection
onto the Euclidean distance space. This approach confines the recovered matrix
to the EDM cone at each iteration of the matrix completion algorithm. The
theoretical guarantees of the calibration performance are obtained considering
the random and locally structured missing entries as well as the measurement
noise on the known distances. This study elucidates the links between the
calibration error and the number of microphones along with the noise level and
the ratio of missing distances. Thorough experiments on real data recordings
and simulated setups are conducted to demonstrate these theoretical insights. A
significant improvement is achieved by the proposed Euclidean distance matrix
completion algorithm over the state-of-the-art techniques for ad hoc microphone
array calibration.Comment: In Press, available online, August 1, 2014.
http://www.sciencedirect.com/science/article/pii/S0165168414003508, Signal
Processing, 201
Silence Models in Weighted Finite-State Transducers
We investigate the effects of different silence modelling strategies in Weighted Finite-State Transducers for Automatic Speech Recognition. We show that the choice of silence models, and the way they are included in the transducer, can have a significant effect on the size of the resulting transducer; we present a means to prevent particularly large silence overheads. Our conclusions include that context-free silence modelling fits well with transducer based grammars, whereas modelling silence as a monophone and a context has larger overheads
A Weighted Finite State Transducer tutorial
The concepts of WFSTs are summarised, including structural and stochastic optimisations. A typical composition process for ASR is described. Some experiments show that care should be taken with silence models
Cepstral normalisation and the signal to noise ratio spectrum in automatic speech recognition.
Cepstral normalisation in automatic speech recognition is investigated in the context of robustness to additive noise. It is argued that such normalisation leads naturally to a speech feature based on signal to noise ratio rather than absolute energy (or power). Explicit calculation of this {\em SNR-cepstrum} by means of a noise estimate is shown to have theoretical and practical advantages over the usual (energy based) cepstrum. The SNR-cepstrum is shown to be almost identical to the articulation index known in psycho-acoustics. Combination of the SNR-cepstrum with the well known perceptual linear prediction method is shown to be beneficial in noisy environments
SNR Features for Automatic Speech Recognition
When combined with cepstral normalisation techniques, the features normally used in Automatic Speech Recognition are based on Signal to Noise Ratio (SNR). We show that calculating SNR from the outset, rather than relying on cepstral normalisation to produce it, gives features with a number of practical and mathematical advantages over power-spectral based ones. In a detailed analysis, we derive Maximum Likelihood and Maximum a-Posteriori estimates for SNR based features, and show that they can outperform more conventional ones, especially when subsequently combined with cepstral variance normalisation. We further show anecdotal evidence that SNR based features lend themselves well to noise estimates based on low-energy envelope tracking
- …